Proactive Fault Monitoring in Enterprise Servers

نویسندگان

  • Keith Whisnant
  • Kenny C. Gross
  • Natasha Lingurovska
چکیده

New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing servers. These time series provide quantitative metrics associated with physical variables (distributed temperatures, voltages, and currents throughout the system), "soft" performance variables (loads, throughputs, queue lengths, bit error rates, etc.), and various quality-of-service (QoS) metrics. The CSTH signals are continuously archived to an offline circular file (i.e. the "Black Box Flight Recorder") that is helping to identify and eliminate costly sources of No-Trouble-Founds (NTFs) in Sun systems; and the signals are concurrently processed in real time using advanced pattern recognition for proactive anomaly detection. Examples are presented of the uses of the CSTH coupled with pattern recognition for high-sensitivity predictive failure analysis that is helping to increase component and system availability goals while decreasing the incidence of "No Trouble Found" (NTF) events that have become a costly serviceability/warranty issue in the enterprise computing industry.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Architecture for an Adaptive Intrusion-Tolerant Server

We describe a general architecture for intrusion-tolerant enterprise systems and the implementation of an intrusion-tolerant Web server as a specific instance. The architecture comprises functionally redundant COTS servers running on diverse operating systems and platforms, hardened intrusion-tolerance proxies that mediate client requests and verify the behavior of servers and other proxies, an...

متن کامل

Improved Methods for Early Fault Detection in Enterprise Computing Servers Using SAS Tools

Advanced telemetry systems are being developed to collect and archive hundreds of system performance, throughput, quality-of-service (QoS), and physical variables for the purpose of enhancing the reliability, availability, serviceability, scalability, and security of business-critical enterprise computing servers. SAS software was chosen for this project because of the language's powerful codin...

متن کامل

A High-Performance and Fault-Tolerant Flow Control Method for Enterprise Servers

Network routers for parallel enterprise servers need faulttolerance as well as high performance to support a seamless value chain of e-business. This paper introduces a new cut-through flow control method, called the pathfinder, which provides an efficient restarting capability without the extra header delivery overhead for normal non-faulty routes. We present the router architecture and the de...

متن کامل

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...

متن کامل

A request-routing framework for SOA-based enterprise computing

Enterprises may use a service-oriented architecture (SOA) to provide a streamlined interface to their business processes. To scale up the system, each tier in a composite service usually deploys multiple servers for load distribution and fault tolerance. Such load distribution across multiple servers within the same tier can be viewed as horizontal load distribution. One limitation of this appr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005